Introduction

This tutorial will walk you through the entire data science pipeline starting from data collection and processing, then moving on to exploratory data analysis and data visualization. Next, we will use hypothesis testing and machine learning to provide analysis. Lastly, we will show the messages covering insight learned during the tutorial. However, this tutorial will be focusing more on the data processing and analysis using visualization created using Pyplot and Plotly library. The Data Lifecycle

Loading data

The data set we will use to analyze is the Homicide Reports (1980-2014) from FBI and FOIA which can be download here. The reason for choosing this data is because it contains many variables that we are able to do variety of analyzsis with from different angles. In addition, by analying the homicide reports and looking at the number of cases hopefully we can find the trends and be more aware of how series it can be.

In [2]:
import pandas as pd
import numpy as np

data = pd.read_csv("database.csv", low_memory=False)
df = pd.DataFrame(data)
#replace unknown data to np.nan
df.replace('Unknown',np.nan, inplace='true')
df.head()
Out[2]:
Record ID Agency Code Agency Name Agency Type City State Year Month Incident Crime Type ... Victim Ethnicity Perpetrator Sex Perpetrator Age Perpetrator Race Perpetrator Ethnicity Relationship Weapon Victim Count Perpetrator Count Record Source
0 1 AK00101 Anchorage Municipal Police Anchorage Alaska 1980 January 1 Murder or Manslaughter ... NaN Male 15 Native American/Alaska Native NaN Acquaintance Blunt Object 0 0 FBI
1 2 AK00101 Anchorage Municipal Police Anchorage Alaska 1980 March 1 Murder or Manslaughter ... NaN Male 42 White NaN Acquaintance Strangulation 0 0 FBI
2 3 AK00101 Anchorage Municipal Police Anchorage Alaska 1980 March 2 Murder or Manslaughter ... NaN NaN 0 NaN NaN NaN NaN 0 0 FBI
3 4 AK00101 Anchorage Municipal Police Anchorage Alaska 1980 April 1 Murder or Manslaughter ... NaN Male 42 White NaN Acquaintance Strangulation 0 0 FBI
4 5 AK00101 Anchorage Municipal Police Anchorage Alaska 1980 April 2 Murder or Manslaughter ... NaN NaN 0 NaN NaN NaN NaN 0 1 FBI

5 rows × 24 columns

Plotting and analyzing

For the first plot, we are interested in seeing the number of cases each year from 1980 to 2014. To count the number of cases instead of directly counting the incident column, I have used the size() method of groupby to count the number of rows in each group because I thought it might be a little bit more accurate since 2 incidents can be between the same victim and perpetrator.

In [3]:
import matplotlib.pyplot as plt
plt.style.use('ggplot')
g = df.groupby('Year')
years = sorted(g.groups.keys())
size = g.size().values.ravel()
fig, ax = plt.subplots()
ax.plot(years, size, marker='.', linestyle='-', ms=5, color = 'purple', alpha = .5)
start, end = ax.get_xlim()
ax.xaxis.set_ticks(np.arange(start, end, 1))
ax.set_xlabel('Year')
ax.set_ylabel('Number of Homicide')
ax.set_title("1980-2014 Number of Homicide by Year")
plt.xticks(rotation=90)
plt.show()
plt.close("all")

From the plot above we can immediately see that there is a huge declined in the number of homicide in the late 1990s. Unfortunately, after some research there still seems to be no definite answer for the caused of the declined. However, there are some articles talk about some of the guesses for the declined. The links are provide below:

Fitting a linear model

We can use sklearn that fit a linear regression model into the above graph.

In [4]:
from sklearn import linear_model
regr = linear_model.LinearRegression()
#fitting the regresson model
x = years
y = size
x = np.reshape(x,(-1,1))
y = y.reshape(-1,1)
regr.fit(x, y)
fig, ax = plt.subplots()
ax.plot(years, size, marker='.', linestyle='None', ms=5, color = 'purple', alpha = .5)
ax.plot(years, regr.predict(x).ravel(), color='blue', alpha= .5)
start, end = ax.get_xlim()
ax.xaxis.set_ticks(np.arange(start, end, 1))
ax.set_xlabel('Year')
ax.set_ylabel('Number of Homicide')
ax.set_title("1980-2014 Number of Homicide by Year")
plt.xticks(rotation=90)
plt.show()
plt.close("all")

We can see from the regression line that even the number of homicide climb up around 1980s and early 2000s, the overall trend for number of homicide is decreasing as year passed.

For the next plot, we would like to see the relationship between the victim and the perpetrator. "Other" category includes some of the closer relationships which are:

  Neighbor   
  Boyfriend/Girlfriend     
  Friend      
  Family                
  Common-Law Husband      
  Common-Law Wife          
  Stepdaughter              
  Stepfather                
  Stepmother                
  Stepson                                                  
  Ex-Husband               
  Ex-Wife                  
  Employee                 
  Employer                   
In [5]:
g1 = df.groupby(['Year','Relationship'])
g1 = g1.size()

#Create a new dataframe that has the year as index and three colums indicting number of 'Stranger','Acquaintance', and 'Other'
df2 = pd.DataFrame(index = years, columns = ['Stranger', 'Acquaintance', 'Other'])
stranger = []
acq= []
other = []
y = 1980
s = 0
#counting the total number of specific relationship for each year
for index, series in g1.iteritems():
    
    if(index[1] == 'Stranger'):
        stranger.append(series)
    elif (index[1] == 'Acquaintance'):
        acq.append(series)
    else:
        s += series
    if(y == index[0]-1):
        y += 1
        other.append(s)
        s = 0
other.append(s)  

df2['Stranger'] = stranger
df2['Acquaintance'] =  acq
df2['Other'] = other

Bar graphs

For this plot I had used pandas dataframe plot which is the same as pyplot but much more simpler if you already have an organized dataframe. You can learn more from here

In [6]:
f, ax1 = plt.subplots(1, figsize=(20,6))
ax1.set_xlabel('Year')
ax1.set_ylabel('Number of Homicide')
ax1.set_title("Relationship Between Victim and Perpetrator")
df2.plot.bar(stacked=True,ax=ax1, alpha = .5, width = .8, color =['#F4561D','#F1911E','#F1BD1A'])
plt.show()

From the above bar graph, we can see the decrease number of homicide in all three category of relationships we analyzed. However, while the number of cases start off pretty close for stranger and acquaintance, we can see the decreasing trend for the acquaintance is more obvious than stranger. We can also see that the number of homicide between some closer relationships mark by 'Other' did not seems to decrease a lot compare to other two, the number even become very close to the number of acquaintance in the 2000s.

After we have seen the number of homicide by the relationships between victim and perpetrator we might also want to know the sex of the victim and perpetrator, so we have included the graphs for number of homicide victim and perpetrator by sex.

In [7]:
g2 = df.groupby(['Year','Victim Sex'])
#reshape panda series to one column of # of Female Victim and one column of # Male Victim 
g2 = g2.size().values.reshape(35,2)
df3 = pd.DataFrame(index = years, columns =['#Female Victim','#Male Victim'], data=g2)

f, ax2 = plt.subplots(1, figsize=(20,6))
ax2.set_xlabel('Year')
ax2.set_ylabel('Number of Victim')
ax2.set_title("Sex of Homicide Victim")
df3.plot.bar(ax=ax2, color=['r','b'],alpha=0.5, width=0.8)

g3 = df.groupby(['Year','Perpetrator Sex'])
g3 = g3.size().values.reshape(35,2)
df4 = pd.DataFrame(index = years, columns =['#Female Perpetrator','#Male Perpetrator'], data=g3)

f, ax3 = plt.subplots(1, figsize=(20,6))
ax3.set_xlabel('Year')
ax3.set_ylabel('Number of Perpetrator')
ax3.set_title("Sex of Homicide Perpetrator")
df4.plot.bar(ax=ax3, color=['r','b'],alpha=0.5, width=0.8)
plt.show()

It might not be very surprising to see male number to be much more than female, but it is interesting to see that the trend for victim and perpetrator seems almost identical. From the resulted graphs, we also noticed the number for the perpetrator is less than the number of victim which is because the lack of information of the perpetrator for the unsolved cases. So, for the next plot we will show the percentage of homicide solved each year.

In [8]:
g4 = df.groupby(['Year','Crime Solved'])
#reshape panda series to one column of # of Female Victim and one column of # Male Victim 
g4 = g4.size().values.reshape(35,2)
df5 = pd.DataFrame(index = years, columns =['#Not Solved','#Solved'], data=g4)

#calculate the crime solve percentage
df5['Crime Solved %'] = (df5['#Solved']/(df5['#Solved']+df5['#Not Solved'])*100)
In [9]:
x = years
y = df5['Crime Solved %']
x = np.reshape(x, (-1,1))
y = y.values.reshape(-1,1)

regr = linear_model.LinearRegression()
#fitting the regresson model
regr.fit(x, y)

fig, ax5 = plt.subplots()
ax5.plot(df5.index, df5['Crime Solved %'], marker='.', linestyle='None', ms=5, color = 'orange')
ax5.xaxis.set_ticks(np.arange(start, end, 1))
ax5.set_xlabel('Year')
ax5.set_ylabel('Percentage of Homicide Solved')
ax5.set_title("1980-2014 Percentage of Homicide Solved by Year")
# plot the regression line
ax5.plot(x.ravel(), regr.predict(x).ravel(), color='blue', alpha= .5)   
plt.xticks(rotation=90)
plt.show()

The result was a little bit unexpected for me. At first, I thought we can see a clear linear line of increase percentage of solved homicide cases because of the technology improvement and since we can learn from previous experience. However, we can see from the regression line, the percentage of solved cases is actually decreasing as time passed. The reason could be because perpetrators also have the knowledge and the technology that make the cases harder to solve.

Analyze by States

Choropleth map

Choropleth map is another power technique that provides strong visualization. Below we will use the choropleth map to show the total number of homicide cases from 1980 to 2014.

In [10]:
g5 = df.groupby('State')
g5 = g5.size()

# since we can not show DC in the 50 states map we decided to add the number of DC into Maryland since DC is located at Maryland
maryland_total = g5.get('District of Columbia') + g5.get('Maryland')
g5.set_value('Maryland',maryland_total)

# We will need to translate the states name into state code for the plotly map to process the data
code = ['AL','AK', 'AZ', 'AR','CA', 'CO','CT','DE','DC','FL',
    'GA', 'HI', 'ID','IL','IN','IA', 'KS','KY', 'LA', 'ME', 'MD','MA','MI',
     'MN', 'MS', 'MO','MT', 'NE', 'NV', 'NH','NJ','NM','NY', 'NC', 'ND', 'OH',
     'OK', 'OR','PA','RI','SC', 'SD','TN', 'TX','UT', 'VT', 'VA', 'WA', 'WV', 'WI', 'WY']
g5.keys= code
In [11]:
import plotly
#must added this code to use plotly offline so you do not have to have an account with plotly
plotly.offline.init_notebook_mode()

#we created the purple color scale to use
scl = [[0.0, 'rgb(242,240,247)'],[0.2, 'rgb(218,218,235)'],[0.4, 'rgb(188,189,220)'],\
            [0.6, 'rgb(158,154,200)'],[0.8, 'rgb(117,107,177)'],[1.0, 'rgb(84,39,143)']]

data = [ dict(
        type='choropleth',
        colorscale = scl,
        autocolorscale = False,
        locations = g5.keys,
        z = g5.values,
        locationmode = 'USA-states',
    
        marker = dict(
            line = dict (
                color = 'rgb(255,255,255)',
                width = 2
            ) ),
        colorbar = dict(
            title = "Number of cases")
        ) ]

layout = dict(
        title = '1980 - 2014 Number of Homicide by State',
        geo = dict(
            scope='usa',
            projection=dict( type='albers usa' ),
            showlakes = True,
            lakecolor = 'rgb(255, 255, 255)'),
             )
    
fig = dict( data=data, layout=layout )
plotly.offline.iplot( fig )

From the result, we can make a prediction that the states that has higher population also have higher number of cases. To prove our hypothesis we have also gather the state population data from Census The data was maintained by gathering state population each year from 1980 to 2014 manually and put it in the excel document.

In [12]:
popdf = pd.read_excel("state_population.xlsx")

#adding extra columns to include number of cases which can be use later
popdf['Cases']= g5.values 
#same as the pervious data we also want to add the data of DC into MD and take the average of each year population
m = popdf.loc[popdf['State']=='MD'] 
d = popdf.loc[popdf['State']=='DC'] 
t = m.values+d.values
t = np.delete(t,0)
t = np.delete(t,35)
a = np.mean(t)
popdf.set_value(20,'Average', a)
popdf.head()
Out[12]:
State 1980 1981 1982 1983 1984 1985 1986 1987 1988 ... 2007 2008 2009 2010 2011 2012 2013 2014 Average Cases
0 AL 3893888 3918531 3925266 3934102 3951820 3972523 3991569 4015264 4023844 ... 4672840 4718206 4757938 4785298 4799918 4815960 4829479 4843214 5.217157e+06 11376
1 AK 401851 418491 449606 488417 513702 532495 544268 539309 541983 ... 680300 687455 698895 713985 722713 731089 736879 736705 6.055167e+05 1617
2 AZ 2718215 2810107 2889861 2968925 3067135 3183538 3308262 3437103 3535183 ... 6167681 6280362 6343154 6413737 6467163 6549634 6624617 6719993 4.689746e+06 12871
3 AR 2286435 2293201 2294257 2305761 2319768 2327046 2331984 2342355 2342656 ... 2848650 2874554 2896843 2921606 2939493 2950685 2958663 2966912 2.579688e+06 6947
4 CA 23667902 24285933 24820009 25360026 25844393 26441109 27102237 27777158 28464249 ... 36250311 36604337 36961229 37349363 37676861 38011074 38335203 38680810 3.210574e+07 99783

5 rows × 38 columns

In [13]:
plotly.offline.init_notebook_mode()
scl = [[0.0, 'rgb(242,240,247)'],[0.2, 'rgb(218,218,235)'],[0.4, 'rgb(188,189,220)'],\
            [0.6, 'rgb(158,154,200)'],[0.8, 'rgb(117,107,177)'],[1.0, 'rgb(84,39,143)']]

data = [ dict(
        type='choropleth',
        colorscale = scl,
        autocolorscale = False,
        locations = popdf['State'],
        z = popdf['Average'],
        locationmode = 'USA-states',
    
        marker = dict(
            line = dict (
                color = 'rgb(255,255,255)',
                width = 2
            ) ),
        colorbar = dict(
            title = "Population")
        ) ]

layout = dict(
        title = '1980 - 2014 Average Population by State',
        geo = dict(
            scope='usa',
            projection=dict( type='albers usa' ),
            showlakes = True,
            lakecolor = 'rgb(255, 255, 255)'),
             )
    
fig = dict( data=data, layout=layout )
plotly.offline.iplot( fig )

The result from the average population data agrees with our hypothesis and looks almost identical as the pervious choropleth map.

To have a better view of the relationship between number of homicide by state and the average population by state, we can use the linear model to draw the regression line again.

In [14]:
fig, ax6 = plt.subplots()
ax6.plot(popdf.Cases,popdf.Average, linestyle='None', marker='.' )
x = popdf.Cases
y = popdf.Average
x = x.values.reshape(-1,1)
y = y.values.reshape(-1,1)

regr = linear_model.LinearRegression()
regr.fit(x, y)

ax6.plot(x.ravel(), regr.predict(x).ravel(), color='blue', alpha= .5)  
#removing auto offset and scientific notation fo large number
ax6.ticklabel_format(useOffset=False, style='plain')
ax6.set_xlabel('Number of Homicide')
ax6.set_ylabel('Average State Population')
ax6.set_title("1980-2014 Total Number of Homicide vs. States Average Population")
plt.show()

The result clearly shows a positive relation between the number of homicide and the population. As the increase of the population increase the number of homicide cases also increase.

Motion Bubble Chart

Motion Bubble chart is another visualization that help us

Deal with missing data

In [15]:
import queue 

dfState = df.groupby(['State','Year']).size()

#find and fill missing data using queue to check the years range for each state
years = queue.Queue()
#range is inclusive for the start values and exclusive for the end value
for j in range(1980,2015):
    years.put(j)
#iter rows in the dataframe and find the years that each states is missing a data and add np.nan to it
for i, row in dfState.iteritems(): 
    if(years.empty()):
        for j in range(1980,2015):
            years.put(j)
    y = years.get()
    if(type(i) != int):
        if(i[1] != y):
            for x in range(y, i[1]):
               
                dfState.loc[(i[0],x)] = np.nan
                y = years.get()
            

#tranfer pandas series to data frame
dataState = dfState.to_frame('Crime')

#making year and state columns
dataState = dataState.reset_index()
#sort dataframe first by year then by state
dataState.sort_values(by=['Year', 'State'], inplace=True)

#now we want to add the population data to our dataframe
#drop the unused columns
temp_pop = popdf.drop('Average',1)
temp_pop.drop('Cases', 1,inplace=True)
temp_pop.drop('State', 1,inplace=True)
temp_pop = temp_pop.transpose()
pop = temp_pop.as_matrix()

dataState['Population'] = pop.reshape(1785,1)
dataState['pop'] = dataState['Population']
#rearrange to use it in the bubble chart
dataState = dataState[['Year','pop','Crime','Population','State']]

Finally we can import motionchart and pass dataframe to the motion chart The bubble size will be determine by the size of the state population The y-axis for the motion chart will be the population and year, the x-axis will be the number of crime

In [44]:
import os
from IPython.display import display, HTML, IFrame
def to_notebook1(width = 900, height = 700): 
        display(IFrame(src="temp.html", width = width, height = height))
In [45]:
from motionchart.motionchart import MotionChart, MotionChartDemo
mChart = MotionChart(df=dataState, title = "Crime Cases by States")
mChart.to_browser()
to_notebook1()
In [ ]: